Goto

Collaborating Authors

 negative curvature


simple-saddle-camera-version

Neural Information Processing Systems

Escaping saddle points is a central research topic in nonconvex optimization. In this paper, we propose a simple gradient-based algorithm such that for a smooth function f: Rn!R, it outputs an -approximate second-order stationary point in O(logn/ 1.75)iterations. Compared to the previous state-of-the-art algorithms by Jin et al. with O(log4 n/ 2) or O(log6 n/ 1.75) iterations, our algorithm is polynomially better in terms of logn and matches their complexities in terms of 1/ .


Adaptive Negative Curvature Descent with Applications in Non-convex Optimization

Neural Information Processing Systems

Negative curvature descent (NCD) method has been utilized to design deterministic or stochastic algorithms for non-convex optimization aiming at finding second-order stationary points or local minima. In existing studies, NCD needs to approximate the smallest eigen-value of the Hessian matrix with a sufficient precision (e.g., $\epsilon_2\ll 1$) in order to achieve a sufficiently accurate second-order stationary solution (i.e., $\lambda_{\min}(\nabla^2 f(\x))\geq -\epsilon_2)$. One issue with this approach is that the target precision $\epsilon_2$ is usually set to be very small in order to find a high quality solution, which increases the complexity for computing a negative curvature. To address this issue, we propose an adaptive NCD to allow for an adaptive error dependent on the current gradient's magnitude in approximating the smallest eigen-value of the Hessian, and to encourage competition between a noisy NCD step and gradient descent step. We consider the applications of the proposed adaptive NCD for both deterministic and stochastic non-convex optimization, and demonstrate that it can help reduce the the overall complexity in computing the negative curvatures during the course of optimization without sacrificing the iteration complexity.



Zeroth-OrderNegativeCurvatureFinding: Escaping SaddlePointswithoutGradients

Neural Information Processing Systems

Several classical results have shown that, forρ-Hessian Lipschitz functions (see Definition 1), using the second-order information like computing the Hessian [33] or Hessian-vector products [1, 9, 2], one can find anϵ-approximate second-order stationary point (SOSP, f(x) ϵ and 2f(x) ρϵI).



Appendices

Neural Information Processing Systems

The Hessian of f(Z) can be viewed as an KN KN matrix by vectorizing the matrix Z. For deeper linear networks, it can be shown that flat saddle points exist at the origin, but there are no spurious local minima [34,37]. While most of these results based on the bottom-up approach explain optimization and generalization of certain types of deep neural networks, they provided limited insights into the practice of deep learning. In fact, our proof techniques are inspired by recent results on low-rank matrix recovery [77,80]. Some of the metrics are similar to those presented in [1]. Figure 7 depicts the learning curves in terms of both the training and test accuracy for all three optimization algorithms (i.e., SGD, Adam, and LBFGS).